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Abstract 


A graph-theoretic design process and software tool is defined for selecting a 
multiprocessor scheduling solution for a class of computational problems . The prob- 
lems of interest are those that can be described with a dataflow graph and are 
intended to be executed repetitively on a set of identical processors. Typical applica- 
tions include signal processing and control law problems. Graph-search algorithms 
and analysis techniques are introduced and shown to effectively determine perfor- 
mance bounds , scheduling constraints, and resource requirements. The software tool 
applies the design process to a given problem and includes performance optimization 
through the inclusion of additional precedence constraints among the schedulable 
tasks. 


1. Introduction 

This paper describes methods capable of determin- 
ing and evaluating the steady-state behavior of a class of 
computational problems for iterative parallel execution 
on multiple processors. The computational problems 
must be capable of being described by a directed graph. 
When the directed graph is a result of inherent data 
dependencies within the problem, the directed graph is 
often referred to as a “dataflow graph.’* Dataflow graphs, 
generalized models of computation, have received 
increased attention for use in modeling parallelism inher- 
ent in computational problems (refs. 1 through 3). This 
attention can be attributed not only to the ease at which 
dataflow graphs can model parallelism but also in their 
amenability to direct interpretation of program flow and 
behavior (ref. 4). 

In this paper, graph nodes represent schedulable 
tasks and graph edges represent the data dependencies 
between the tasks. Because the data dependencies imply 
a precedence relationship, the tasks make up a 
partial-order set; that is, some tasks must execute in a 
particular order, whereas other tasks may execute inde- 
pendent of other tasks. When a computational problem or 
algorithm can be described with a dataflow graph, the 
inherent parallelism present in the algorithm can be 
readily observed and exploited. The modeling methods 
presented in this paper are applicable to a class of data- 
flow graphs where the time to execute tasks is assumed 
constant from iteration to iteration when executed on a 
set of identical processors. Also, the dataflow graph is 
assumed to be data independent; that is, any decisions 
present within the computational problem are contained 
within the graph nodes rather than described at the graph 
level. The dataflow graph provides both a graphical and 
mathematical model capable of determining run-time 
behavior and resource requirements at compile time. In 
particular, dataflow graph analysis is shown to be able to 
determine the exploitable parallelism, theoretical perfor- 
mance bounds, speedup, and resource requirements of 
the system. Because the graph edges imply data storage, 


the resource requirement specifies the minimum amount 
of memory needed for data buffers as well as the proces- 
sor requirements. Obtaining this information is useful in 
allowing a user to match the resource requirements with 
resource availability. In addition, the nonpreemptive 
scheduling and synchronization of the tasks that are suf- 
ficient to obtain the theoretic performance are specified 
by the dataflow graph. This property allows the user to 
direct the run-time execution according to the dataflow 
firing rules (i.e., when tasks are enabled for execution) so 
that the run-time effort is reduced to simply allocating an 
idle processor to an enabled task (refs. 5 and 6). When 
resource availability is not sufficient to achieve optimum 
performance, a technique of optimizing the dataflow 
graph with artificial data dependencies, called control 
edges, is discussed. 

Predicting the computing performance, resource 
requirements, and processor utilization connected with 
the execution of a dataflow graph requires the determina- 
tion of steady-state behavior. Dataflow graph analysis 
algorithms and rules are defined in this paper for deter- 
mining the scheduling constraints, that is, earliest execu- 
tion times and mobility, for all tasks under steady-state 
conditions. It is also shown that certain initial conditions 
represented by initial data in a dataflow graph may result 
in a transient-state execution different from the 
steady-state execution. The analysis algorithms are 
shown to detect such transient conditions. The method 
for determining periodic steady-state behavior is based 
on first describing the execution of data associated with a 
single computational iteration, referred to as a “data set.” 
Second, the transient state is distinguished from the 
steady state if necessary when initial data are present. 
Finally, the periodic execution for multiple iterations is 
determined from the steady-state single iteration 
description. 

For the mathematical models presented, an efficient 
software tool which applies the models is desirable for 
solving problems in a timely manner. A software tool 
developed for design and analysis is presented. The soft- 
ware program, referred hereafter as the “Design Tool,” 



provides automatic and interactive analysis capabilities 
applicable to the design of a multiprocessing solution. 
The development of the Design Tool was motivated by a 
need to adapt multiprocessing computations to emerging 
very-high-speed integrated circuit (VHSIC) space- 
qualified hardware for aerospace applications. In addi- 
tion to the Design Tool, a multiprocessing operating sys- 
tem based on a directed-graph approach called the 
ATAMM multicomputer operating system (AMOS) was 
developed. AMOS executes the rules of the algorithm to 
architecture mapping model (ATAMM) and has been 
successfully demonstrated on a generic VHSIC space- 
borne computer (GVSC) consisting of four processors 
loosely coupled on a parallel-interface (PI) bus (refs. 5 
and 6). The Design Tool was developed not only for the 
AMOS/G V SC application-development environment 
presented in references 5 and 7 but for other potential 
dataflow applications. For example, the design proce- 
dures based on ATAMM solve signal processing prob- 
lems addressed by Parhi and Messerschmitt in 
reference 3. (See ref. 8.) Information provided by the 
Design Tool could also be used as scheduling constraints 
as done in reference 9 to aid other scheduling algorithms. 

The modeling of a computational problem with a 
dataflow graph and analysis diagrams is discussed in 
section 2. A forward-search algorithm is defined and is 
shown to determine the earliest execution times for all 
tasks. Section 3 discusses a modification to the dataflow 
graph described in section 2, which lends itself to 
the modeling of initial conditions. In addition, a 
backward-search algorithm is defined and shown to 
determine the mobility of the tasks and transient condi- 
tions which affect the steady-state behavior. The perfor- 
mance metrics and resource requirements procedures 
implemented in the Design Tool are described in 
section 4. The memory requirements of data shared 
among tasks, as described by a directed graph, is shown 
to be bounded. Rules for determining the minimum 
memory requirements for buffering-shared data are 
defined. The Design Tool displays and features are pre- 
sented in section 5 where the performance results are 
compared with the theoretical results derived in the pre- 
vious sections. Section 5 also presents execution time 
results regarding the Design Tool implementation of the 
algorithms presented in sections 2 and 3. Applications 
and future research are summarized in section 6. 

2. Dataflow Graphs and Scheduling Diagrams 

A generalized description of a multiprocessing prob- 
lem and how it can be modeled by a directed graph is 
presented in this section. Such formalism is useful in 
defining the graph analysis algorithms and rules which 
determine scheduling constaints. A computational prob- 
lem (job) can often be decomposed into a set of tasks to 



be scheduled for execution (ref. 10). If the set of tasks are 
not independent of one another, a precedence relation- 
ship is imposed on the tasks in order to obtain correct 
computational results. A task system can be represented 
formally as a 4-tuple (% , L, M 0 ) where 

T set of n tasks to be executed, { 7^, T 2 , F 3 , ..., T n ] 

precedence relationship on T such that F r Tj 
signifies that Tj cannot execute until completion 
of 7} 

L nonempty, strictly positive set of run-time laten- 
cies such that task Tj takes L, amount of time to 
execute, {L h L 3 , ..., L n } 

initial state of system, as indicated by presence 
of initial data 

Such task systems can be described by a directed 
graph where nodes (vertices) represent the tasks and 
edges (arcs) describe the precedence relationship 
between the tasks. When the precedence constraints 
given by -< are a result of the dataflow between the 
tasks, the directed graph is referred to as a “dataflow 
graph (DFG)” as shown in figure 1. Special transitions 
called sources and sinks are also provided to model the 
input and output data streams of the task system. The 
presence of data is indicated within the DFG by the 
placement of tokens. The DFG is initially in the state 
indicated by the marking M 0 . The graph moves through 
other markings as a result of a sequence of node firings 
(executions); that is, when a token is available on every 
input edge of a node and sufficient resources are avail- 
able for the execution of the task represented by the 
node, the node fires. When the node associated with task 
Tj fires, it consumes one token from each of its input 
edges, delays an amount of time equal to L,, and then 
deposits one token on each of its output edges. Sources 
and sinks have special firing rules; sources are uncondi- 
tionally enabled for firing, and sinks consume tokens but 
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do not produce any. By analyzing the DFG in terms of its 
critical path, critical circuit, dataflow schedule, and the 
token bounds within the graph, the performance charac- 
teristics and resource requirements can be determined 
a priori. The Design Tool depends on this dataflow repre- 
sentation of a task system, and the graph-theoretic per- 
formance metrics presented herein. 

The graph execution for a single iteration, unlimited 
resources assumed, can be portrayed with a Gantt chart 
where horizontal bars are used to indicate when tasks 
may be scheduled for execution. Such a chart is referred 
to hereafter as a “single graph play (SGP) diagram,” 
which is shown in figure 2 for the DFG of figure 1 . The 
SGP can be constructed by calculating the earliest start 
(ES) times for all tasks. The ES times can be calculated 
by envisioning the migration of a single data set through 
the graph. Since the condition for a node to fire (begin 
execution) is having a token present on all its inputs, the 
ES time for a given task is equal to the longest path 
latency (starting from the source) for all paths leading to 
its inputs. The longest input path latency would indicate 
the time at which all input tokens would be present for 
execution. The amount of time required for all nodes of a 
graph to execute a single data set or graph iteration is 
referred to as the schedule length, denoted as to. For gen- 
erality, the task latencies shown in figure 1 are given in 
clock units, and therefore the schedule length is 
shown in figure 2 to be equal to 600 clock units. 
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Figure 2. Single graph play diagram, co = 600 clock units. 

The two algorithms, defined in this paper, that 
implement a forward and backward search of the directed 
graph and other analyses are based on a linked-list repre- 
sentation of the graph. In this way, pointers can be used 
for efficient progression through the graph from any 
given starting point. An example illustrating the connec- 
tions between node objects and edge objects is shown in 
figure 3. The object address pointers are denoted by 


asterisks. A node object points to just one input and 
one output. All other input and outputs are connected 
to the node by the next input and next output 
pointers. A null pointer indicates that no other input or 
output exists. 

Given a linked-list graph representation as shown in 
figure 3, the following forward-search algorithm deter- 
mines the earliest start times for all nodes (tasks). The 
algorithm employs the depth-first searching method 
where the graph is penetrated as deeply as possible from 
a given source before fanning out to other nodes. For 
each node encountered in the search, the algorithm calls 
the procedure SearchFwd recursively for each output 
edge associated with the node. The recursive nature of 
the algorithm allows a depth-first search of the graph to 
be done while implicitly retaining the next edge (starting 
point for the next path to traverse when fanning out) and 
accumulated path latency on the memory stack. The 
arguments passed into SearchFwd are an address 
pointer (edge) to an edge structure (fig. 3) and the cur- 
rent path latency (path_latency) up to the edge. 
Also, let node specify a pointer to a node structure. An 
edge will point to a next_output if present, and will 
be null if no other output edges for the current node 
exist. The ES Algorithm is stated as follows: 

A. Initialize earliest start times for all nodes to 
zero 

B. Execute procedure SearchFwd (source, 
output, 0) for every source in graph by start- 
ing with first output edge of source; path 
latency, the second parameter, initially set to 
zero 

SearchFwd (edge, path_lat ency) 

1. If edge . next_output is not null, 
call SearchFwd (edge . next_out- 
put, path_latency). 

2. Get the node that uses this edge for 
input by setting node equal to 
edge . terminal_node. 

3. Determine the earliest start of node, 

ES (node), such that ES (node) = max 
[ES (node), path_latency]. 

4. Increase path_latency by the node 
latency, L node . 

5. Set edge equal to the first output edge of 

node, edge = node . output. 

6. If a sink has been reached (edge = null), 
return from this procedure; else repeat 
Step 1. 
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(b) Linked-Ust representation. 

Figure 3. Linked -list storage of dataflow graph. 
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The ES Algorithm execution time is graph depen- 
dent and is bounded by 

Bound = N. (1) 

Over all paths in DFG 

where Nj is the number of nodes in a given path. Because 
the number of paths in a given graph with at most 
Anodes is bounded by A 2 , the expression (eq. (1)) has a 
worst-case bound of A 3 . Therefore, the ES Algorithm 
has a polynomial-time complexity of the order of A , or 
0(JV 3 ). 

The elapsed time between the production of an input 
token by the source and the consumption of the corre- 
sponding output token by the sink is defined as the time 
between input and output (TBIO). When initial tokens 
are not present, to will be equal to TBIO, otherwise co 
may be greater than TBIO. As discussed later, the SGP 
determined by the ES analysis given by the ES Algo- 
rithm when initial tokens in the forward dataflow 
direction are present may not be representative of the 
steady-state behavior, SGP S _ SJ at run time but instead por- 
trays a transient state, SGP,_ r Refinements to the com- 
puted earliest start times may be required to obtain the 
SGP s _ s . A method for determining these refinements is 
included in the next section. 


Of particular interest are the cases when the algo- 
rithm modeled by the DFG is executed repetitively for 
different data sets. The iteration period and, thus, 
throughput is characterized by the metric TBO (time 
between outputs) where TBO is defined as the time 
between consecutive consumptions of output tokens by a 
sink. It can be shown that because of the consistency 
property of dataflow graphs, all tasks execute with period 
TBO (refs. 1 1 and 12). This implies that if input data are 
injected into the graph with period TBI (time between 
inputs) then output data will be generated at the graph 
sink with period TBO equal to TBI. 

The periodic graph execution for multiple iterations 
can be portrayed in another Gantt chart referred to as a 
“total graph play (TGP) diagram.” The TGP diagram 
shows the execution over a single iteration period of 
TBO. Like the single graph play diagram, the total graph 
play diagram represents task executions with horizontal 
bars. The TGP can be constructed from the SGP by 
dividing the SGP into segments of width TBO starting 
from the left of the diagram. The resulting SGP from the 
previous example for an arbitrarily selected TBO period 
of 333 clock units is shown in figure 4. Each segment is 
representative of the execution associated with a particu- 
lar data set when the graph is executed periodically. 



Time, clock units 

Figure 4. Segmented single graph play diagram. 
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Consequently, these segments are assigned relative data 
set numbers, 1 to % from right to left. Overlapping these 
segments portrays the graph execution for multiple data 
sets within a TBO period as shown in figure 5. Note that 
the relative data set numbers assigned to the task bars 
within the TGP of figure 5 correspond to the numbered 
SGP segments of figure 4. The fact that within a TBO 
period, every task will execute exactly once is obvious 
from the nature of how the TGP is constructed by over- 
lapping TBO-width segments from the SGP. The total 
computing effort (TCE) within a TBO interval from SGP 
segments would therefore equal the sum of all task laten- 
cies within the latency set X. 



/ r + TBO 

Figure 5. Total graph play diagram. TBO = 333 clock units. 


Constructing the TGP by overlapping SGP segments 
is equivalent to mapping the ES times (relative to the 
SGP) to a time interval of width TBO by using the map- 
ping function ES modulo TBO. The number of SGP seg- 
ments is equal to the maximum number of data sets 
simultaneously present in the graph at steady state and 
indicates the level of pipeline concurrency that is being 
exploited. This metric is given by applying the ceiling 
function to the ratio of the schedule length co to TBO as 
shown in the following equation: 


<P - 


co 

TBO 


( 2 ) 


l The ceiling of a real number x, denoted as r jc~| , is equal to the 
smallest integer greater than x. 


By numbering the SGP segments 1 to tf’from right to left, 
a relative data set numbered will refer to a data set 
injected into the graph 1 TBO interval after a data set 
numbered D- 1 . Overlapped bars for a given task indi- 
cate that the task has multiple instantiations as for task B. 
That is, the task is executed on different processors 
simultaneously for different data sets. Allowing multiple 
task instantiations is a key mechanism for increasing 
speedup. 

The inherent nature of dataflow graphs is to accept 
data as quickly as the graph and available resources (pro- 
cessors and memory) allow. When this occurs, the graph 
becomes congested with tokens waiting on edges for pro- 
cessing because of the finite resources available, without 
resulting in an increase in throughput above the 
graph-imposed upper bound (refs. 2 and 1 3). When 
tokens wait on the critical path for execution, however, 
an increase in TBIO above the lower bound occurs. This 
increase in TBIO can be undesirable for many real-time 
applications. It is therefore necessary to constrain the 
parallelism that can be exploited in order to prevent 
resource saturation. Constraining the parallelism in data- 
flow graphs can be controlled by limiting the input injec- 
tion rate to the graph. Adding a delay loop around the 
source makes the source no longer unconditionally 
enabled (ref. 5). It is important to determine the appropri- 
ate lower bound on TBO for a given graph and number 
of resources. Determination of the lower bound on TBO 
is deferred to section 4. 

3. Dataflow Graph Analysis 

In the absence of initial tokens within the graph, a 
latest finish (LF) time analysis would be similar to the 
depth-first searching method used to calculate the earliest 
start times, only in the reverse direction. That is, search- 
ing backward from all sinks, the latest time each task 
associated with an encountered node must complete in 
order to prevent an increase in the TBIO given by the ES 
time analysis can be determined. The latest finish time 
for a given task is equal to TBIO (for a given sink) less 
the maximum path latency to the associated node output 
from all possible paths leading backwards from the sink. 
The combination of earliest start and latest finish times 
provide the means to calculate the float or slack time that 
might be present for each task. Slack time indicates the 
maximum delay in task completion that can be tolerated 
without delaying the start times of successor tasks which 
result in an increase in TBIO. Slack time for a task is 
given by 

Slack time = LF (T.) - ES (T.) - L. (3) 

with latency L . 
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When initial tokens are present within the graph, the 
ES and LF analysis presented here must be modified 
slightly. The method for determining the steady-state 
behavior of a dataflow graph when initial tokens are 
present is based on a simple extension to the earliest start 
time analysis described in the previous section and a lat- 
est finish time analysis to be discussed here. It will be 
shown in later examples that initial tokens within the 
DFG not only affect the calculations of ES and LF times 
but may also be associated with recurrence loops (result- 
ing in graph circuits), which tend to complicate the graph 
search process. Modifications to the dataflow graph, 
which simplify the analysis, are defined here and can be 
shown to result in an equivalent model of the original 
graph. This modified dataflow graph is referred hereafter 
as the MDFG. 

The MDFG can be constructed by letting all edges 
with one or more initial tokens undergo the transforma- 
tion shown in figure 6 where such edges are terminated 
with “virtual” sinks. Each virtual sink is labeled with the 
identifier of the node that consumes tokens from the orig- 
inal edge. In the cases where all input edges of a node 
have initial tokens, a virtual source for each such node is 
added so that the node is not left dangling without an 
input edge. The addition of these virtual sources main- 
tains compatibility with the ES Algorithm. The result- 
ing MDFG of the dataflow graph in figure 1 is shown in 
figure 7. 

The MDFG can now model the more complex prob- 
lem containing initial tokens but in a simpler, linear 
(source to sink) fashion. Now, the same ES analysis from 
all sources to sinks can be conducted as before. However, 
in order to ensure that the new MDFG is equivalent to 
the original dataflow graph, an additional time constraint 
must be imposed on the graph at these virtual sinks. 
Referring to figure 6, the time constraint is defined as 
follows: 

LF ( F.) = ES (T t ) + d (TBO) (4) 

where LF( 7}) represents the LF time of F, due to the ini- 
tial tokens, ES(F,) represents the ES time of T t , and d is 
the number of initial tokens on the F,- F, edge. Stated 
in words, equation (4) determines the latest finish time of 
task T t which returns a token on the edge initialized with 
d tokens such that the firing of task T t will not be 
delayed. The ES(F,) is determined by the ES Algo- 
rithm starting from all MDFG sources. If equation (4) 
results in a LF time less than the earliest finish (EF) time 
of F,, a time constraint has been violated. Since a task 
cannot complete execution sooner than its earliest finish 
time (as determined from the ES analysis), a transient 
condition has been detected. For the first iteration, the 
graph will execute according to the SGP^ as defined by 



Figure 6. Constructing the modified dataflow graph. 
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Figure 7. The modified dataflow graph equivalent of figure 1 . 


the ES Algorithm. However, since the next data set 
will arrive 1 TBO interval later, an additional time con- 
straint will be imposed if initial tokens exist in the graph. 
The node T t with d initial input tokens has the potential 
(depending on other input dependencies) of repeated fir- 
ings until all d tokens are consumed. With each node fir- 
ing with period TBO, the elapsed time to consume 
d tokens is the product of d and TBO. The predecessor 
node F, must return a token within d(TBO) time relative 
to the ES so that the next firing of T t is not delayed. 
Therefore, in order for node F, to generate its first token 
in this timely manner which maintains the task schedule 
defined by the first iteration SGP f _ 5 , it must do so by the 
time determined by equation (4). Otherwise, the firing of 
node T t will be delayed, resulting in SGP^ ^ SGP f 

Now that it has been shown that timing conflicts 
determined by equation (4) indicate the presence of a 
transient state, SGP , _ s ± SGP w , a method is needed to 
translate the SGP f _ g to the SGP^ ^ . By adjusting the ear- 
liest start times of the nodes affected by this delay, the 
steady-state behavior when initial tokens are present can 
be determined. When equation (4) indicates a timing 
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conflict, determine the time difference between the result 
of equation (4), LF(T f ), and the earliest finish of the T t , 
EF(7 / ) = ES(7,) + L,, and denote this difference by A, 

A = EF ( T.) - LF (T.) (5) 

The method to translate the SGP, to the SGP sim- 

t-s s-s 

ply involves adding A to the ES time of T v An ES time 
analysis is then conducted again on the graph nodes con- 
tained in the paths dependent on T v After completing this 
ES time adjustment, an LF time analysis is required as 
before for all paths backward from the sinks. This pro- 
cess is repeated until no time conflicts are detected by 
equation (5); that is, A < 0 . The following algorithm 
determines both the LF times and the transient adjust- 
ments to the ES times and accounts for initial token tran- 
sients as described above. 

Given the linked-list graph representation shown in 
figure 3, a depth-first search algorithm that employs the 
same method used by the ES Algorithm (only in the 
reverse direction) will determine the latest finish times 
for all nodes (tasks). The algorithm calls the procedure 
SearchBkwd recursively for each input edge. As with 
the ES Algorithm, the recursive nature of this 
backward-search algorithm results in a depth-first search 
of a graph from sinks to sources while implicitly retain- 
ing the next edge (starting point for the next path to 
traverse when fanning out) and accumulated path latency 
on the memory stack. The arguments passed in to 
SearchBkwd are an address pointer (edge) to an edge 
object in figure 3 and a latency value (path_ 
latency). This latency value is defined as the TBIO at 
the starting sink less the sum of node latencies along the 
current path from the sink up to an encountered node. As 
in the SearchFwd procedure, let node specify a 
pointer to a node structure of figure 3. An edge will 
point to a next_input if present, and will be null if no 
other input edges for the current node exist. The itera- 
tive nature of the LF Algorithm for the cases where 
initial tokens are present within the DFG requires the 
inclusion of a boolean condition. The boolean condition 
Done in the LF Algorithm indicates when the process 
of determining LF times for all nodes is complete. The 
LF Algorithm is stated as follows: 

A. Initialize all LF times of tasks in T to maxi- 
mum storage value and set Done = False. 

B. While not Done Loop through to Step K. 

C. Set Done to True and repeat Step D for every 
sink in the graph. 

D. If the sink is not virtual, set LF equal to the 
earliest start of the sink (already established 
by the ES Algorithm) and skip to Step J; 
else determine the terminal node, T t , of the 


edge with the initial token and set LF equal to 
ES(T f ) + d(TBO) where ES^) is the earliest 
start of T t , d is the number of initial tokens, 
and TBO is the iteration period. 

E. Set A equal to earliest finish of 7} minus LF. 

F. If A is less than or equal to zero go to Step J; 
else set Done to False. 

G. Increase the earliest start of T t by A. 

H. Call the procedure SearchFwd (^.output, 
ES(7^) + L,) of the ES Algorithm in order 
to propagate the A time shift for all descen- 
ded nodes of T v 

I. Increase LF by A. 

J. Call the procedure SearchBkwd (sink, 
input, LF). 

K. Loop until Done. 

SearchBkwd (edge, path_latency) 

1. If edge.next_input is not null, call 
SearchBkwd (edge.next_input, 
path_latency). 

2. Get the node that uses this edge for 
output by setting node equal to 
edge.ini tial_node. 

3. Determine the latest finish of node, LF 
(node), such that LF (node) = min 
[LF (node), path_latency]. 

4. Decrease path_latency by the 
node latency, L node . 

5. Set edge equal to the first input edge of 

node, edge = node. input. 

6. If a source has been reached (edge = 
null), return from this procedure; 
else repeat Step 1 . 

Since the method just presented to translate the 
SGPj s to the SGP s s is recurrent, one may question if a 
solution exists for all cases. This is important since, if a 
solution does not exist, the method would hang in an infi- 
nite loop. The answer is yes, there is a solution. The 
proof lies in the fact that the only potential problem 
results when circuits with initial tokens are present in the 
dataflow graph. If adjustments were made to the ES 
times of the nodes dependent on the edge initialized with 
tokens that eventually led back to the original edge (due 
to a circuit) with a new EF time, the new EF time would 
again cause a conflict in equation (4), and the process 
would repeat indefinitely — a run-away condition. Such a 
condition implies that nodes firing on tokens propagating 
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through such a circuit could not produce a token on the 
initialized edge in a timely manner. It has been shown 
that the minimum graph-theoretic iteration period, T 0 , is 
given by the ratio of the ith circuit latency, C f , to the 
number of tokens in the circuit, D t for all circuits within 
the DFG (refs. 3, 9, 11, and 14): 


T 

o 


max 


(cj 

k d u 


(for all ith circuits) (6) 


Equation (6) determines the minimum time in which 
tokens can propagate through a circuit in one periodic 
cycle and thus establishes a lower bound on TBO. The 
only way this algorithm would fail to complete is if the 
TBO of equation (4) is less than its lower bound T 0 given 
by equation (6). Since TBO cannot be less than T 0 , such 
a timing conflict cannot occur and thus the ES/LF algo- 
rithms previously presented will always have a solution. 

As an alternative approach, the steady-state ES times 
could be determined during the forward search of the 
graph by applying equation (4) (solving for ES(T,) with 
LF(7,) set equal to the path latency) whenever encounter- 
ing forward-path initial tokens. After determining all 
steady-state ES times, the LF times could then be calcu- 
lated without requiring any further adjustments to the ES 
times, resulting in a one-time pass of the graph in the for- 
ward and backward direction. The algorithms are pre- 
sented in the potentially recurrent form for the purpose of 
efficiently handling the frequent cases. That is, applica- 
tion of equation (4) (solved for ES(T f )) would be 
required each time an edge with initial tokens was 
encountered by traversing multiple paths that may con- 
verge on the edge. Use of equation (4) once when begin- 
ning with a virtual sink would tend to minimize its use. 
Also, it is felt that the frequent cases involve uninitial- 
ized edges or initialization of recurrence loops (no 
forward-path tokens). Thus, this only requires the 
one-time use of equation (4) by the LF Algorithm for 
the purpose of calculating slack time within the recur- 
rence loop. Like the ES Algorithm, the time complex- 
ity of the LF Algorithm is bounded by equation (1). 
Thus, the LF Algorithm can also be executed in poly- 
nomial time with a worst-case bound of 0(/V 3 ). 

Applying the LF Algorithm to the DFG of 
figure 1 for a TBO of 333 clock units is shown in 
figure 8. As expected, the slack time of task C extends all 
the way to the start time of task F. This would also be the 
case for task E if it were not for the initial token on the 
E D edge. Because of this token, the slack time of 
task E extends out only 33.3 clock units for the current 
iteration period of 333 clock units. The fact that this 
slack is associated with the next iteration of task D is 
apparent from the TGP diagram of figure 5 where the 
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Figure 8. Single graph play diagram showing slack time, oo = 
600 clock units. 


time between the completion of task E and the start of 
task D is equal to 33.3 clock units. 

4. Performance Metrics and Resource 
Requirements 

The two types of concurrency that can be exploited 
in dataflow algorithms can be classified as parallel and 
pipeline. The TBO and TBIO performance metrics 
defined in the previous sections are important in evaluat- 
ing the efficiency of the algorithm execution, that is, how 
well the inherent parallelism within the algorithm is 
being exploited. Therefore, it is important to determine 
the bounds on these metrics which define the optimum 
scheduling solution. 

4.1. Critical Path Analysis 

Parallel concurrency is associated with the execution 
of tasks that are independent (no precedence relationship 
imposed by -K ). The extent to which parallel concur- 
rency can be exploited is dependent on the number of 
parallel paths within the DFG and the number of 
resources available to exploit the parallelism. The TBIO 
metric in relation to the time it would take to execute all 
tasks sequentially can be a good measure of the parallel 
concurrency inherent within a DFG. If there are no initial 
tokens present in the DFG, TBIO can be determined with 
the traditional critical path analysis, where TBIO is given 
as the sum of latencies in L along the critical path. When 
M 0 defines initial tokens in the forward direction, the 
graph takes on a different behavior as represented by the 
new paths within the MDFG. Cases such as this include 
many signal processing and control algorithms where ini- 
tial tokens are expected to provide previous state infor- 
mation (history) or to provide delays within the 
algorithm. For the example shown in figure 9, the task 
output z(n) associated with the nth iteration is dependent 
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z(fl) = x(rt) * y(n-d^) * w(n-d 2 ) 



Figure 9. Example function implementation. 


on the current input x(n ), input y ( n - d j ) provided by 
the ( n - fifi)th iteration, and input w (n- d 2 ) produced 
by the (n - d 2 )\h iteration. 


4.2. Calculated Speedup 

Pipeline concurrency is associated with the repetitive 
execution of the algorithm for successive iterations with- 
out waiting for earlier iterations to complete. 
Equation (6) defines the lower bound iteration period T Q 
due to the characteristics of the graph alone. That is, if 
circuits are present in the DFG, T 0 is given by 
equation (6), otherwise T 0 is zero. Given a finite number 
of processors, however, the actual lower bound on itera- 
tion period (or TBO^) is given by 


TB°, t 


max 


( r .' 



( 8 ) 


where TCE (total computing effort) is the sum of laten- 
cies in L y 


TCE = X L i ( 9 ) 

1 e L 


Implementation of this function would require ini- 
tial tokens on the y(n-d\) edge and d 2 initial tokens 
on the w(n-d 2 ) edge in order to create the desired 
delays. In such cases, the critical path and thus TBIO are 
also dependent on the iteration period TBO. For exam- 
ple, given that a node fires when all input tokens are 
available, assuming sufficient resources, the earliest time 
at which the node shown in figure 9 could fire would be 
dependent on the longest path latency leading to either 
the x (n) or y{n-d\) edge. Assuming that the d\ and 
d 2 tokens are the only initial tokens within the graph, the 
time it would take a token associated with the nth itera- 
tion to reach the x (n) edge would equal the path latency 
leading to the jc(/i) edge. Likewise, the minimum time 
at which the “token” firing the nth iteration on the 
y(n-d\) edge could arrive from the source equals the 
path latency leading to the y(n-d\) edge. However, 
since this “token” is associated with the (n - Jj)th itera- 
tion (produced d x (TBO) intervals earlier), the actual 
path latency referenced to the same iteration is reduced 
by the product of d\ and TBO. From this example, it is 
easy to infer that the actual path latency along any path 
with a collection of d initial tokens is equal to the sum- 
mation of the associated node latencies less the product 
of d and TBO. Thus, the critical path (and TBIO) is a 
function of TBO and is given as the path from source to 
sink that maximizes the following equation for TBIO: 


TBIO = max 



d (TBO) 


(for all paths) 0) 


where d is the total number of initial tokens along the 
path. It is easy to see that the critical path for the DFG in 
figure 1 is A B K F, resulting in a TBIO of 
600 clock units. 


and R is the number of available processors. The theoret- 
ically optimum value of R for a given TBO period, 
referred to as the calculated /?, is given as 


R = 


TCE 

TBO 


( 10 ) 


Since every task executes once within an iteration period 
of TBO with R processors and takes TCE amount of time 
with one processor, speedup S using Amdahl’s Law can 
be defined as 


TCE 

TBO 


(ID 


and processor utilization U ranging from 0 to 1 can be 
defined as 


u = £ 02 ) 

4.3. Run-Time Memory Requirements 

The scheduling techniques offered by this paper are 
intended to apply to the periodic execution of algorithms. 
In many instances, the algorithms may execute indefi- 
nitely on an unlimited stream of input data, for example, 
digital signal processing algorithms. Even though the 
multiprocessor schedules determined by the ES Algo- 
rithm and LF Algorithm are periodic, it is important 
to determine if the memory requirements for the data are 
bounded. Just knowing that the memory requirements are 
bounded may not be enough. One may also wish to cal- 
culate the maximum memory requirements a priori. By 
knowing the upper bound on memory, the memory can 
be allocated statically at compile time to avoid the 
run-time overhead of dynamic memory management. 
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Since the dataflow graph edges imply physical storage of 
the data shared among tasks, graph-theoretic rules are 
defined in this section capable of determining the bound 
on memory required for the shared data. 

To present a slightly more detailed model of parallel 
computation of tasks represented by a DFG is helpful for 
the following discussion. The Petri net model shown in 
figure 10 describes the activities associated with the exe- 
cution of ordered dataflow tasks, Tj K Tj. A Petri net 
such as the one shown in figure 10 is a special class of 
Petri nets called a marked graph (ref. 15). This model is 
equivalent to the ATAMM computational marked graph 
(CMG) shown in references 13, 14, and 16. As shown in 
figure 10, the edges directed from left to right represent 
dataflow while the edges from right to left represent con- 
trol flow. Of particular interest, the edges associated with 
the output empty (OE) place can be regarded as an 
“acknowledgment edge.” That is, given the data depen- 
dency Tj Tj, the acknowledgment edge provides a 
signal to node Tj indicating that node T } has consumed a 
token from the output full (OF) place. The number of 
tokens present at any one time in the OE place represents 
the total number of empty data buffers available for out- 
put data tokens. The number of buffers currently occu- 
pied with data tokens is represented by the number of 


tokens in the OF place. Pairing every data edge with an 
acknowledgment edge assures that a buffer will be avail- 
able for the output data before a task begins execution. A 
modeled task is enabled for execution when all necessary 
input tokens to the Fire transition are available. After 
firing, the node will produce a token in the busy place, 
enabling the Data transition. The Data transition for 
node Tj of Twill generate a token at the output places 
after delaying an amount of time equal to L f of L. The 
idle place between the Data and Fire transitions is 
included to convey information about task instantiations 
at run time. The graph shown in figure 10(b) has been 
shown to be consistent (refs. 11 and 15). This implies 
that given an initial marking, the total number of tokens 
within a circuit remains unchanged for all valid markings 
reached by firing transitions. Therefore, the initial num- 
ber of tokens located in the idle place will ultimately 
migrate to the busy place; this indicates the number of 
task instantiations at run time. Based on equation (6), the 
number of tokens that must be present in a circuit for a 
given iteration period, TBO, is given by the following 
equation 


D. = 

i 


C 

i 

TBO 


(for all i circuits) (13) 



(a) DFG model of T i < 7}. 


OE 



Figure 10. Petri net representation of dataflow graph. 
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and thus the circuit formed by the idle place between 
the Data and Fire transition implies that the required 
number of instantiations of task 7^ that was derived from 
the TGP diagram is determined by the following 
equation: 


Instantiations of T. 

i 


L 

i 

TOO 


(14) 


Because DFG tokens carry data values (or pointers 
to where the data are located when the tokens become 
heavy), the DFG edges which transport tokens from one 
node to the next, imply physical memory space. Again 
relying on the token conservation property, the summa- 
tion of the initial OF tokens due to initial data and the ini- 
tial number of OE tokens needed to satisfy equation (13) 
determines the maximum buffer space required for the 
data associated with the DFG edge at run time — ideally, 
ignoring fault tolerant issues. The initial tokens required 
in the OE and OF places can also be determined from the 
TGP diagram, but in a less obvious way. 

Initial OE tokens can be determined by examining 
the relative firing times of the predecessor and successor 
tasks along with the corresponding data set displace- 
ments. The OE Rule can be used to determine the initial 
number of OE tokens indicating the data buffers that are 
initially empty and is as follows: 

Let S ( Tj) represent the start time of task 7} rel- 
ative to a TOO interval as portrayed in the TGP 
diagram, and let D s (T i ) represent the relative 
data set number associated with the start time of 
task T-. The start time S ( 7/) can be calculated 
directly from the ES of T i with the equation 

S(T.) = ES (71) modulo TOO (15) 


The relative data set number can also be deter- 
mined from the TGP diagram or calculated 
directly by the equation 


D ( T .) = <P- 

V 


ES (7\) 
TOO 


(16) 


where the floor function is applied to the ratio of 
ES(7y) and TOO, and is given by equation (2). 
Then, given a task T p , let T s represent the suc- 
cessor task which uses the output data of T p as 
input and OE ps be the initial OE tokens required 
for the precedence relation T p ^ T y 

If D s (T p ) -D S (T S ) >0 
Then If S ( T p ) < S ( T s ) 

Then OE ps = D s ( T p ) -D S (T S ) +1 
Else OE ps = D s (T p ) - D s (T s ) 

Else OE ps = 0 


In terms of the graph nodes, a negative 
D § (T p ) -D S (T S ) indicates that the successor node has 
fired more often than the predecessor node it is depen- 
dent on. The only way this could be possible is if there 
were initial tokens present in the OF place. A positive 
difference D s (T p ) -D S (T S ) represents the number of 
times the predecessor node fires before the successor 
node fires once. This difference would therefore be the 
initial tokens required in the OE place. If 
S ( T p ) > S ( T s ) then the successor node would have 
returned the one token required in the OE place for the 
predecessor to fire again, and thus no additional tokens 
are needed. However, the condition 5 ( T p ) <S(T s ) indi- 
cates that the predecessor node must fire before or at the 
same time the successor node fires and returns the OE 
token. Therefore, the S (T p ) <S(T s ) condition requires 
that one extra token be included initially in the OE place. 

For example, the OE Rule utilizing the TGP of 
figure 5 for the C F specifies that OE CF = 2 or in 
other words, two empty data buffers are initially 
required. Since the data edge did not have any initial 
tokens (no initially full buffers), two buffer spaces would 
be required at run time. 

There is one item that must be mentioned concerning 
the OE Rule. For all practical purposes the < in the 
S(T p ) <S(T s ) expression can be replaced with a <. 
This change has the effect of delaying the firing of the 
predecessor node by one Fire transition time when T p 
and T s would otherwise start simultaneously. If the Fire 
transition time which may represent the reading of input 
data is considered negligible in the case of large-grained 
algorithms, being conservative with tokens (and thus 
buffer space) is easily tolerated. The rule represents the 
more conservative case in order to satisfy the general 
problem. One special case is shown in figure 1 1 as a 
node with a self-recurrence circuit (representing the fact 
that the task represented by the node has history). The OE 
Rule would indicate that one initially empty buffer is 
needed in addition to the initial data occupying a second 
buffer. Use of the conservative token approach would not 
make sense in this case because a node that is 
self-dependent cannot wait on itself to fire. 

The OE Rule determines the number of data buffers 
needed in addition to the buffers required for initial data 
for all edges within the DFG. Therefore, the resource 
requirements in terms of total buffer space for a given 
data edge is equal to the OE tokens given by the OE 
Rule plus the number of initial tokens present on the 
edge. Calculating resource requirements in terms of pro- 
cessors is more straightforward. The minimum processor 
requirement R for a given TOO at steady-state can be 
derived simply by counting the maximum overlap of bars 
within the corresponding TGP. However, the R 
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initial data 


(b) Petri net model. 

Figure 11. Petri model of self-loop circuit. 

determined may not be optimum for a given . For 
example, given only three processors, TBO#, for the 
DFG of figure 1 by equation (8) is equal to 333, which 
by equations (11) and (12) would indicate that three pro- 
cessors would provide maximum linear speedup with 
100 percent processor utilization. Even though the pro- 
cessor requirements for a single graph iteration is three 
(determined by counting the maximum overlap of bars in 
fig. 8), the processor requirements for repetitive execu- 
tion with a period of 333 requires four processors as can 
be derived from figure 5. This is because of the fact that 
the precedence constraints imposed by makes finding 
this optimal solution NP-complete and the design process 
presented in this paper only provides the determination 
of a sufficient number of processors in order to guarantee 
a schedule meeting TBO and TBIO requirements 
(refs. 9 and 10). In fact, one cannot guarantee that a 
multiprocessor-scheduling solution even exists when all 
three parameters (TBO, TBIO, and R) are fixed (ref. 9). 
Accordingly, it is necessary to find another schedule, if 
one exists, that would provide the desired computational 
speedup performance; a method for doing so is discussed 
in the next section. 



Time, clock units 

(b) SGP diagram. CO = 600 clock units. 

Figure 12. Diagrams with E C control edge. 

4.4. Control Edges 

Imposing additional precedence constraints or artifi- 
cial data dependencies onto T (thereby changing the 
schedule) is a viable way to improve performance (refs. 5 
and 17). These artificial data dependencies are referred to 
as “control edges.” As an illustration, observe that there 
is needless parallelism being exploited for the single 
graph execution shown in figure 8; that is, three proces- 
sors are not necessary to exploit all of the parallel 
concurrency — two would suffice. This presents an 
opportunity to take advantage of the slack time present in 
the graph to reduce the processor requirement without 
affecting the critical path. 

Since task C does not need to complete execution 
until 500 clock units as shown in figure 8, a control edge 
can be included in order to create the precedence rela- 
tionship E C effectively delaying task C until the 
completion of task E as shown in figure 12. The subse- 
quent TGP with the added control edge is shown in 
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figure 13 with the resulting resource envelope showing 
the processor utilization over the given TBO period. As 
can be seen from figure 13, it is only necessary to effec- 
tively move the amount of effort requiring four proces- 
sors in such a way as to fill the idle time shown in the 
resource envelope. It turns out in this example that this 
can be done by delaying task D behind task B (a delay of 
67 clock units) in relation to the TGP description of 
steady-state behavior. The new TGP diagram can be 
derived from the original by shifting all successor tasks 
of task D accordingly. The TGP diagram with the added 
B D precedence relationship shown in figures 14(a) 
and (b) results in 100 percent processor utilization. The 
new steady-state SGP shown in figure 14(c) can be con- 
structed by shifting tasks D, E, C, and F to the right by 
67 clock units, as was done to obtain the new TGP 
diagram. 



t t + TBO 

(a) TGP diagram. TBO = 333 clock units. 



(b) Resource envelope. 


Figure 13. Periodic behavior with E C control edge. 

Referring to the new SGP diagram in figure 34(c), it 
is apparent that this scheduling solution for optimum 
throughput and processor utilization has been achieved at 
the cost of increasing TBIO. Inserting the B ^ D prece- 


dence relationship to delay the start of task D behind the 
start of task B by 67 clock units, resulting in a TBIO of 
667 clock units, is an interesting concept. Since we know 
that three processors are sufficient for tasks B and D to 
start at the same time for the first iteration, the B D 
precedence relationship has caused a transient condition. 
The reason for this transient becomes apparent by exam- 
ining the TGP schedule of figures 14(a) and (b). The 
TGP schedule indicates that the nth token (relative data 
set number 2) consumed by node D is the (n-l)th 
token (relative data set number 1 ) produced by the prede- 
cessor node B; this implies that one initial token is 
required on the B D control edge, as shown in 
figure 14(d), to create the single-TBO delay required to 
achieve the steady-state schedule shown in figures 14(a) 
and (b). Without the single-TBO synchronization delay 
due to the initial token, the path 
A^nB^nC^D^nE^nF would result in a TBIO 
equal to the graph TCE of 1000 rather than 667 clock 
units (eq. (7)). This is interesting in that the transients 
caused by initial data token delays that tend to compli- 
cate the analysis become a useful trait for control edges. 
Without initial tokens, control edges have only 
intra-iteration precedence relationships between two 
tasks and consequently provide only limited rescheduling 
options. The rescheduling options are those shown by the 
SGP diagram between independent tasks. Control edges 
properly initialized with tokens result in inter-iteration 
relationships between tasks that provide additional 
rescheduling options. Such control edges allow one to 
choose rescheduling options from the TGP diagram 
which can provide more opportunities to find tasks to 
delay behind other tasks. 

Up to now, a general rule for calculating OF tokens 
was not needed because the initial data tokens are given 
by the algorithm description as portrayed in figure 9. 
However, with the use of control edges it is necessary to 
calculate the required number of OF tokens. The ques- 
tion that may have been raised about the OE Rule is 
what if D s ( T p ) - D s ( T s ) is a negative number; this 
would mean that the tokens bounded to this edge circuit 
are initially located in the OF place. Just like any linear 
algebra problem with two unknowns, two rules (equa- 
tions) are required in order to solve for the total number 
of tokens (OE and OF) needed within a given edge cir- 
cuit. This second rule is referred to as the “OF Rule” 
and determines the number of tokens, if any, initially 
required on the forward (OF) edge. The OF Rule is 
stated as follows: 

Let S ( Tj) and F ( T,) represent the start time 

and finish time of the tasks T±, respectively, and 

let D s (Tj) represent the relative data set num- 
ber associated with the start of task 7}; S ( T [) , 
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(a) TGP diagram. TBO = 333 clock units. 



(b) Resource envelope. TBO = 333 clock units. 
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(c) SGP diagram. TBIO = 667 clock units; co = 667 clock units. 

Figure 14. Periodic behavior with E 
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(d) Modified DFG diagram. 
C and B D control edges. 


F(T ( ), and D S (T{) are relative to a TBO 
interval as portrayed in the TGP diagram. As for 
the OE Rule, these values can be obtained from 
the TGP diagram or from equations (15) 
and (16) with the addition of 

F ( 7\) = (ES(7\) +L.) modulo TBO (17) 

Because the data set number associated with the 
start of execution will be greater than the data 
set number associated with the completion of a 
multiply-instantiated task, let Df(Ti) represent 
the relative data set number associated with the 
finish time of task 7}, which can be calculated 
with 


D f (T.) = 2P- 


ES(7\) + L. 
TBO 


(18) 


Then, given a task T p , let T s represent the suc- 
cessor task which uses the output data of T p as 


input and OF ps be the initial OF tokens required 
for the precedence relation T p T s . 

If D S (T S ) -D f (T p ) >0 
Then If S(T s ) <F(T p ) 

Then OF ps = D S (T S ) - D f {T p ) + 1 
Else OF ps = D s (T s ) -D f (Tp) 

Else OF ps = 0 

In terms of the graph nodes, a negative 
D S (T S ) -Df(T p ) indicates that the predecessor node 
has fired more often than its successor node which is the 
frequent case. These tokens are accounted for in the OE 
Rule. A difference D S (T S ) -Df(T p ) >0 represents 
the number of times the successor node fires before the 
predecessor node completes just once. The only way this 
could occur is if there were initial tokens in the OF place. 
This difference would therefore be the number of initial 
tokens required in the OF place. If S ( T s ) > F (T p ), then 
the predecessor node would have deposited the one token 
required in the OF place for the successor node to fire 
again, and thus no additional tokens are needed. How- 
ever, the condition S(T s ) < F (T p ) indicates that the 
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successor node must fire before the predecessor node 
deposits an OF token. Therefore, the S (T s ) <F(T p ) 
condition requires that one extra token be included ini- 
tially in the OF place. 

Applying the conditions shown in figures 14(a) 
and (b), the OF Rule indicates that one initial token is 
required on the B D control edge as expected from 
this discussion. Also, the OF Rule is general enough so 
that not only will it compute initial tokens (if any) 
required on inter-iteration control edges, but also agree 
with initial token conditions on data edges in most cases. 
In some cases, initial data tokens may only serve the pur- 
pose for which they were intended, that is, to create delay 
conditions for computations as portrayed in figure 9. 
When initial data tokens also affect the steady-state 
schedule, the OF Rule applied to such data edges would 
agree with the initial conditions. Just such a case 
involves the E D in the example graph. As one would 
expect, the OF Rule utilizing the TGP of figure 5 for the 
E D edge results in OF ED = 1 , which indicates that 
one initial token is present. Likewise, the OE Rule spec- 
ifies that OE ed = 0 indicates that an initially empty 
buffer is not necessary at run time, thereby the total 
buffer space for edge E D is defined as 1 . However, 
just as the primary purpose of the OE Rule is to compute 
the number of data buffers required in addition to the ini- 
tial data buffers, the primary purpose of the OF Rule is 
to compute initial tokens for inter-iteration control edges. 
The OF Rule applied to data edges will only convey 
information that the user already knows. Likewise, since 
by definition, control edges do not require data buffers, 
the OE Rule does not serve a purpose for control edges 
unless for some reason the user wanted to implement a 
graph management operating system that treated data 
edges and control edges the same, except for the attach- 
ment of physical buffers. 

One last example would be appropriate before pre- 
senting the Design Tool which implements the algo- 
rithms and rules discussed in this and previous sections. 
It has been shown that the addition of the E ^ C and 
B D control edges for a TBO of 333 clock units 

results in linear speedup with three processors and a 
TBIO equal to 667 clock units. Since this particular solu- 
tion includes an initial token in the forward direction of 
the B D edge, analyzing this graph with the ES and 
LF Algorithms should confirm the correctness of the 
solution. The modified dataflow graph of figure 14(d) 
with the additional control edges is shown in figure 15. 
Utilization of the ES Algorithm results in earliest start 
times of ES(A) = 0, ES(B) = ES(A) 4 X(A) = 100, 
ES(D) = ES(A) + L(A) = 100, ES(E) = ES(D) 
+ 0£» = 300, ES(C) = ES(E) 4- Li E) = 400, and 
ES(F) = ES(C) + LiC) = ES(B) 4 Li B) = 500. 



Figure 15. Equivalent MDFG model of figure 14(d). 


The first application of the backward search by the 
LF Algorithm beginning at the real sink results in lat- 
est finish times of LF(F) = ES(sink) = EF(F) = 600, 
LF(B) = LF(F) - LiF) = 500, LF(A) = LF(B) 

- LiB) = 100, LF(C) = LF(B) = 500, LF(E) = min[LF(C) 
-L(C), LF(F) - LiF)} = 400, LF(D) = LF(E) 

- LiF) = 300, and LF(A) = min[LF(B) - x(B), LF(C) 

- LF(D) - L(D)] = 100. 

Next, applying the LF Algorithm beginning at the 
virtual sink (£>') corresponding to the E -< D data edge 
gives an LF time for node E of ES(D) 4 (l)(TBO) = 100 
4 (1)(333) = 433 clock units which is greater than its ear- 
liest finish of 400 clock units. Progressing backwards 
does not change the latest finish times of nodes A and D. 
Finally, applying the LF Algorithm beginning at the 
virtual sink (D') corresponding to the B D control 
edge gives an LF time for node B of ES(D) 4 (l)(TBO) = 
100 4 (1)(333) = 433 clock units. However, since the 
previous ES analysis indicates that node B cannot com- 
plete until 500 clock units, a transient condition has been 
found with a A (eq. (5)) equal to 67 clock units. There- 
fore, node D initially starts execution as soon as node A 
completes during the transient state but at steady state, 
node D will be delayed after the completion of node A by 
67 clock units. Adding A = 67 clock units to the ES time 
along the path D E C ^ F results in adjusted 
earliest start times of ES(D)' = 167, ES(E)' = 367, 
ES(C)' = 467, and ES(F)' = 567. 

Applying the LF Algorithm again at the virtual 
sink D' gives an LF(B) = ES(D)' 4 (1)(333) = 500 
equal to the earliest finish time of node B, as expected. 
After calculating the latest finish times once more, the 
steady-state scheduling contraints in terms of earliest 
start and latest finish times are defined. The TBIO is the 
earliest start of the sink and is determined to be 
EF(F) = ES(F)' 4 L(F) = 667 clock units. As a final 
check, the TBIO of 667 clock units should agree with 


16 








Table I. Summary of DFG Attributes for TBO = 333 clock units, TBIO = 667 clock units, and R - 3 
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100 

567 

667 
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equation (7) which finds the critical path. By 
equation (7), the path A B ^ D^n E^\ C F 
(containing all nodes and 1 initial token) has a total 
latency of (TCE - (1)(333)) = 667 clock units and is larg- 
est over all paths. Thus, the path A < B -< D < E 
-<( C -K F is critical. Table I lists the steady-state earli- 
est start and latest finish times obtained by applying the 
ES and LF Algorithms to the DFG of figure 15. The 
reader is invited to construct a single graph play diagram 
using the ES times in table I. Likewise, a total graph play 
diagram can be constructed by using start times equal to 
ES modulo TBO. The SGP and TGP should agree with 
figures 14(c) and (a), respectively. 

A summary of other DFG attributes for the schedul- 
ing solution presented above is also provided in table I. 
The attributes listed include task instantiations, data 
memory requirements (buffers), and control edges for a 
TBO of 333 clock units, while utilizing three processors 
100 percent of the time. As noted, this solution is opti- 
mum in terms of TBO and processor utilization but is not 
optimum in terms of TBIO. Note also that even though 
an optimum solution does not exist for this example 
where TBO, TBIO, and R are fixed to optimum values, 
depending on the real-time constraints of the application, 
one could have designed a solution which made other 
trade-offs in performance. For example, another solution 
might maintain a minimum TBIO of 600 clock units 
while letting TBO increase above the lower bound of 
333 clock units. In general, depending on the availability 
of processors, the user has a two-dimensional region 
(TBO by TBIO) in which to make trade-offs. This region 
is referred to as an operating point plane in references 5 
and 17; TBO^ and TBIO^ define the minimum values 
for the two dimensions, respectively. 

5. Design Tool 

The algorithms and rules presented in the previous 
sections have been shown to be applicable to the analysis 


of the class of dataflow graphs described in section 1 . A 
software tool is presented in this section which analyzes 
dataflow graphs and implements these design principles 
to aid the user in the implementation of a multiprocessing 
application. The software, referred to as the “Dataflow 
Design Tool,” or “Design Tool” for brevity, was written 
in Borland C++ 2 for Microsoft Windows. 3 The software 
can be hosted on an i386/486 personal computer or com- 
patible. The Design Tool takes input from a text file 
which specifies the topology and attributes of the DFG. 
A graph-entry tool has been developed to create the DFG 
text file. The various displays and features are shown to 
provide an automated and interactive design process 
which facilitates the selection of a multiprocessor data- 
flow solution. 

The process flow of the Design Tool, upon loading a 
DFG or making modifications to the number of proces- 
sors (/?), iteration period (TBO), or adding control edges 
(new ^ ), is shown in figure 16. After loading a DFG, 
the Design Tool will search the DFG for circuits in order 
to determine the minimum iteration period ( T 0 ) using 
equation (6). The TBO will initially be set to the lower 
bound given in equation (8) where T 0 is zero if no cir- 
cuits are present. The calculated R will initially be given 
by equation (10). Next, the MDFG is automatically con- 
structed due to initial tokens, if present, defined by the 
algorithm. All further analysis is based on the MDFG 
using the ES Algorithm and LF Algorithm in order 
to determine the TBIO, steady-state schedule co and 
buffer requirements (using the OE/OF Rules). Any 
changes to TBO, /?, or results in a reapplication of 
the analysis algorithms and rules. 

The same dataflow graph example shown in figure 1 
is used for demonstration purposes. In this way, the tool 
can be presented while verifying the theoretical results 


2 Version 3.1 by Borland International, Inc. 
3 Version 3.1 by Microsoft Corporation. 
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Figure 16. The design process. 


obtained in the previous sections. The initial perfor- 
mance analysis, without any graph modifications, in 
terms of potential speedup is shown in figure 17 for up to 
six processors. The performance display shows speedup 
verses the number of processors. The display automati- 
cally increases or decreases the abscissa each time the 
number of processors R is changed. Figure 17 indicates 
that maximum speedup performance is attainable with 
four processors; additional processors will not result in 
any further speedup. This leveling-off of performance is 
attributable to the recurrence loop (circuit) within the 
DFG. Without this circuit, the graph-theoretic speedup 
would continue to increase linearly with the addition of 
processors. Physically speaking, however, this linear 
increase in speedup would ultimately break off due to 
operating-system overhead, such as synchronization 
costs and interprocessor communication. 

The Design Tool has a user-interface panel, referred 
to as the “Metrics window” as shown in figure 18, con- 



Figure 17. Speedup display. 


taining buttons and menus for displaying performance 
bounds, setting TBO and R , or invoking the various 
graphic displays. For example, the display shown in 
figure 17 can be invoked by pressing the Perfor- 
mance button. The time measurements shown in the 
Design Tool windows are given in clock units so that the 
resolution of the measurement can be user interpreted. 
Upon analyzing the DFG, the Design Tool has deter- 
mined that TCE is 1000 clock units. The TBIO#, is 
defined by equation (7) based on the graph precedence 
relations ^ due only to the data dependencies (data- 
flow). Due to the critical path A B F, TBIO#, has 
been determined to be 600 clock units. The TBIO will be 
equal to TBIO#, until additional control edges are added 
with the tool, which may change the critical path. The 
TBO#, has been calculated to be 300 clock units based on 
the critical circuit D ^ E, and consequently, TBO is set 
equal to this lower bound. The calculated R is determined 
to be 4, which is the optimum number of processors for 
repetitive, steady-state execution at the given TBO and 
TBIO. 

The SGP window shown in figure 18, created by the 
Design Tool, shows the steady-state execution for a sin- 
gle iteration. The SGP window can be compared with 
that of figure 2. Slack time for task C is shown as an 
unshaded bar. Although there is slack between the com- 
pletion of task E and the start of task F, the recurrence 
relation E D at a TBO of 300 clock units as deter- 
mined by equation (4) has reduced the slack of task E to 
zero. The window also displays the two TBO- width seg- 
ments with a vertical dashed line. Individually controlled 
left and right cursors (solid vertical lines) are provided 
for taking time measurements. Figure 1 8 shows the cur- 
sors measuring the start and duration time of task C to be 
100 clock units each (the “100” next to time at the bot- 
tom of the display indicates the left-cursor time, whereas 
the “100” in parentheses indicates the time between the 
left and right cursors). 
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The TGP window shown in figure 19 displays the 
steady-state schedule of tasks based on the current TBO 
value of 300 clock units. The bars are shaded (with col- 
ors or patterns) according to the relative data set numbers 
shown above the bars. The TGP window has the same 
measurements and viewing features as the SGP window, 
including the time cursors. The time cursors are posi- 
tioned at the far left- and far right-hand sides to indicate 
the TBO interval of 300 clock units as shown in paren- 
theses. The mouse cursor (shown as a hand) can be used 
within the TGP (and SGP) window to point at a bar for 
quick access of information as shown to the right of the 
TGP window in figure 19 for node B. The information 
window shows, among other things, that task B requires 
two instantiations at a TBO of 300 clock units. This is 
also apparent by observing that there are two overlapped 
bars associated with task B for relative data sets 1 and 2. 
The circuit-imposed zero slack time of taskE is por- 
trayed in figure 1 9 by observing that, even though there 
is slack between the completion of task E and the start of 


task F, task D requires scheduling at the same time task E 
completes. Note also that due to the E -< D initial 
token, task D will execute on a data set injected one TBO 
interval later than the data set produced by the comple- 
tion of task E. 

Figure 20 shows how processor requirements and 
utilization can be shown graphically with a resource 
envelope diagram. The Design Tool provides a resource 
envelope window for both the SGP and TGP displays 
referred to as the “single resource envelope” (SRE) and 
“total resource envelope” (TRE), respectively. The TRE 
window for the TGP of figure 19 is shown in figure 20. 
Processor utilization for any time interval defined 
between the left and right time cursors is automatically 
calculated and displayed in a separate window. The pro- 
cessor utilization for the entire TBO interval of 300 clock 
units is shown in figure 20, indicating that a maximum of 
four processors are required with 83.3 percent utilization. 
The Utilization window also shows that, within the same 
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time interval, three out of the four processors are utilized 
100 percent of the time and all four processors are 
utilized 33.3 percent of the time. The Computing 
Effort is the area under the envelope curve and is 
equal to TCE. 

A summary of the task system (% -K , X, M 0 ) is given 
by a window referred to as the “graph summary window” 
shown in figure 21 for the four-processor, 300-clock-unit 
TBO performance level. The graph summary window 
displays the values of X, ES, LF, slack, and instantiations 
(INST) for each task in T along with the initial tokens 
and queue sizes for each edge in . The ES times shown 
in figure 21 are associated with the task start times in 
figure 18. It is apparent from this window that task C is 
the only task with slack (measured to be 300 clock units) 
as already indicated by figure 18. The graph summary 
window also indicates the earlier observation that task B 
requires two instantiations. The OE/OF column provides 
the initial state of the detailed Petri net model of 
figure 10 indicating the initial state 9A. 0 and maximum 
queue size, also shown in the QUEUE column. The 
QUEUE column shows that two buffers are required for 
the data associated with edges B -K F and C F. 

5.1. Design Tool Use in Graph Optimization 

As discussed in the previous section, the example 
DFG has the potential of having a speedup performance 
of 3 with three processors as indicated by figure 15. 
However, the precedence relationship given by the 


dataflow may not lend itself to this analysis in terms of 
requiring three processors at a TBO of 334 clock units. 
Note that the optimum TBO for three processors is 
333 1/3 clock units. The Design Tool maintains the 
defined precision by rounding fractional times up to the 
next integer value. The graph source will ultimately be 
controlled to inject data at a rate 1/TBO determined by 
the Design Tool such that predictable performance can 
be attained and resource saturation avoided. The clock 
resolution used in the actual multiprocessing system is 
assumed to be the same as that defined for the tool, and 
therefore fractional times are rounded to the next clock 
unit for proper input-injection control. 

The inclusion of additional precedence constraints in 
the form of control edges may reduce the processor 
requirements of a DFG for a desired level of perfor- 
mance. Since such a problem of finding this optimum 
solution is NP-complete and requires an exhaustive 
search, the Design Tool was developed to aid the user in 
finding appropriate control edges when needed and to 
make trade-offs when the optimum solution cannot be 
found or does not exist (ref. 9). The design of a solution 
for a particular TBO, TBIO, and R is ultimately applica- 
tion dependent. That is, one application may dictate that 
suboptimal graph latency (TBIO > TBIO^) may be 
traded for maximum throughput (1/TBO i b ) while another 
application may dictate the opposite. An application may 
also specify a control/signal processing sampling period 
(TBO) and the time lag between graph input g(t) and 
graph output g(t - TBIO) that is greater than the lower 



Figure 21. Graph summary window of four-processor schedule shown in figure 19 for TBO = 300 clock units and TBIO = 600 clock units. 
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bounds determined from graph analysis, possibly making 
it easier to find a scheduling solution. 

Use of the Design Tool for solving the optimum 
three-processor solution is presented as an example since 
the results can be compared with the theoretical results in 
the previous section. First, the control edge E ^ C 
which eliminates the needless parallelism for a single 
iteration can be added from the SGP window by selecting 
the add Edge menu option as shown in figure 22. Any 
control edge added within the SGP window will never be 
initialized with tokens resulting in only intra-iteration 
precedence relationships. This is the desired effect with 
the E C relationship. Upon selecting the add Edge 
menu option, the SGP window will prompt the user for a 
terminal node to be delayed by the control edge. Once 
the terminal node (task) has been selected as shown in 
figure 23, all nodes (tasks) independent of the terminal 
node (task C) will be highlighted. These highlighted 
nodes become the only candidates for selection as the ini- 
tial node. Selection of a dependent node is prohibited 



Figure 22. Adding a control edge by using SGP window. 


because a circuit would be generated without any tokens; 
this is a nonexecutable situation. The use of the informa- 
tion window and time cursors may prove useful in mak- 
ing use of slack time or delaying tasks such that any 
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increase in TBIO is minimized. Since task C, duration of 
100 clock units, has 300 clock units of slack time and 
task E finishes 100 clock units short of the start of task F, 
one can easily see that task C can be delayed behind 
task E without increasing TBIO. Selection of node E 
causes the Design Tool to create the control edge 
E C, reapply the analysis algorithms, and create the 
expected SGP shown in figure 24. 

The new periodic schedule as a result of the new 
E C control edge is shown in figure 25(a) with the 
processor utilization portrayed in the TRE window of 
figure 25(b). At this point, a search for additional prece- 
dence relationships is necessary that could effectively 
move the computing effort requiring four processors to 
fill in the underutilized idle time requiring only two pro- 
cessors. As noted in section 4.4, a control edge creating 
the precedence relationship B D provides a solution. 
Addition of this control edge is done in the same way as 
within the SGP window. However, unlike control edges 
added within the SGP window, control edges added from 
the TGP window are automatically initialized with 
tokens as required to assure the desired steady-state 
schedule (using the OE and OF Rules). Insertion of the 
B D control edge from within the TGP window 
results in the schedule and processor utilization as por- 
trayed in figures 26(a) and (b), respectively. It is appar- 
ent from figure 26(a) with the two additional precedence 
relationships, E **\ C and B -K D, that an optimum 
solution for three processors in terms of throughput has 
been found. Note that 0.6 percent of idle time is contrib- 
uted to the rounding up of the ideal 333 1/3 clock units 
TBO to 334 clock units for implementation purposes. As 
mentioned, this solution is only optimal in terms of 
throughput due to the 66 clock units delay of node D 
(indicated by the left and right cursors in fig. 26(a)). 
Since node D lies in the critical path, this delay results in 
a TBIO of 666 clock units, as shown by the LF time of 
task F in figure 27. The graph summary window in 
figure 27 also displays the control edges added for opti- 
mization, indicated by asterisks. Referring to the B D 
control edge, the OF equal to 1 , representing the presence 
of one initial token, characterizes the inter-iteration rela- 
tionship that is required between B and D (one TBO 
delay) to assure the desired schedule in figure 26(a), as 
expected from the analysis in the previous section. 

5.2, Case Study 

Another example is given in this section for the pur- 
poses of demonstrating the dependence that steady- state 
behavior has on and TBO. The same six-node 

graph is utilized except for a different initial marking 9rf 0 
and the additional precedence constraint between 
nodes C and B as shown in figure 28. These differences 
result in a new graph which is referred to as “DFG2.” 



Figure 24. SGP window with control edge E “K C. 



(b) TRE window. 

Figure 25. Windows with control edge E C, 
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Figure 27. Optimized graph summary window of three -processor schedule shown in figure 26(a) for TBO = 334 clock units and 
TBIO = 666 clock units. 



Figure 28. DFG2 with initial token on forward -directed edge. Figure 29. Speedup potential of figure 28 DFG. 


As a result of the additional token in the D E cir- 
cuit, the graph-theoretic speedup bound has increased; 
therefore a speedup capability up to seven processors 
(fig. 29) is provided. The initial token on the B F 
edge affects the steady-state performance differently by 
making TBIO and co dependent on the iteration period, 
TBO. For the purposes of illustrating this effect, the 
scheduling solutions for two different iteration periods 
are shown. The first example shown in figure 30, which 
requires four processors for a TBO of 250 clock units, 
results in a TBIO of 500 clock units (indicated in paren- 


theses using the SGP window cursors) which is less than 
the graph schedule length of 600 clock units (indicated 
next to the Schedule button). At this iteration period, 
both tasks B and C have slack time. The slack time of 
task B is shown to the left for the convenience of display- 
ing an interval equal to the schedule time and because 
any delay in the completion of task B affects the execu- 
tion (start time of task F) for the next data packet 
iteration. 

The initial token on the B F edge also has the 
potential of causing a transient condition such that 
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1 ] Critical Path 


Figure 30. Dataflow schedule of figure 28 for four processors. 



1 I Critical Path 


Figure 31. Dataflow schedule of figure 28 for seven processors. 

SGP^ ^ SGPj which has an effect on the than the TBIO of the graph; however, the critical path 
steady-state performance. The second example, shown in has changed from the previous example. The Design 

figure 31 for the smallest possible iteration period of Tool has found the critical path to be 

150 clock units for seven processors, results in a sched- A C B F. Also, the initial token at this TBO 
ule length equal to 600 clock units, which is still greater performance has caused task F to delay 50 clock units 
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Figure 32. Graph summary of figure 28 for seven processors. 


(indicated by the SGP window cursors), as compared 
with the case shown in figure 30, resulting in a TBIO 
equal to 550 clock units. Because the calculated proces- 
sors (eq. (10)) are equal to the seven “sufficient” number 
of processors (derived from the TGP window) for the 
optimum iteration period of 150 clock units, the 
steady-state schedule shown in the TGP window is an 
optimum solution for this example task system. The TGP 
window also shows that the additional pipeline concur- 
rency allows the simultaneous execution of four data 
packets within a TBO interval. 

Figure 32 shows the task system (T 9rf 0 ) sum- 
mary for a TBO of 150 clock units. The LF of task F with 
no slack indicates that the TBIO is 550 clock units. Also, 
tasks B and D require three and two instantiations, 
respectively. As one might have expected, the queue size 
(memory requirements) has increased from the lower 
speedup example examined in the previous section 
(figs. 20 and 28). 

5.3. Algorithm Implementation Performance 

The ES Algorithm and the LF Algorithm can 
be executed in polynomial time. For typical graphs, the 
actual bound is somewhere between 0(N 2 ) and 0(N 3 ) 
where equation (1) provides a conservative graph- 
dependent bound. The C++ program code for the ES 
Algorithm and the LF Algorithm is included in the 
appendix. This section provides some performance data 



Figure 33. Test graph. 


on the execution of these algorithms within the Design 
Tool. 

The performance results of the ES Algorithm and 
LF Algorithm within the Design Tool were obtained 
for the graphs in figures 1, 14(d), and 33. The graph in 
figure 33 was chosen as a good test when the graph is 
tightly connected. Since the three graphs have six nodes 
(TV = 6) each, the worst-case complexity is given as N 3 or 
216. In addition, the graph-dependent bound given by 
equation ( 1 ) was determined for each graph for compari- 
son with the actual complexity. The time it takes to exe- 
cute steps 1 through 6 in both the ES Algorithm and 
the LF Algorithm is assumed to take a constant time 
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Table II. Design Tool Performance Results 


Graph in 
figure — 

ES Algorithm 

LF Algorithm 

Bound 

C 

Duration, jis 

Bound 

C 

Duration, ps 

1 

10 

8 

297 

13 

12 

665 

15(b) 

15 

10 

390 

20 

18 

920 

34 

64 

32 

934 

64 

32 

1214 


of K\ and AT 2 , respectively. The actual time complexity C 
to complete the ES Algorithm is defined as the num- 
ber of times steps 1 through 6 are executed for a given 
graph such that the total execution time is on the order of 
K X C. Unlike equation (1) which assumes that all nodes 
are traversed for every path, the ES Algorithm and the 
LF Algorithm are more efficient in that each remem- 
bers the previous nodes and path latency covered at any 
given edge branch. Thus, the actual complexity C will be 
less than the bound of equation (1) for most cases. 

The performance of the Design Tool was measured 
on a Gate way 2000 486/33 EISA personal computer. The 
computer operated with a 33-MHz clock speed and con- 
tained 16 MB of RAM memory. From the performance 
results given in. table II, the Bound (eq. (1)) and actual 
complexity C for the graph in figure 33 without initial 
tokens are equivalent for both algorithms. However, 
since the backward-search LF Algorithm will encoun- 
ter more nodes than the forward-search ES Algorithm 
when virtual sinks are present, the Bound and C for the 
graph in figures 1 and 14(d) with initial tokens are differ- 
ent. Note in all cases, however, that C is less than the 
bound given by equation (1) indicating the degree of effi- 
ciency in the algorithms. 

6. Tool Applications and Future Research 

For years, digital signal processing (DSP) systems 
have been used to realize digital filters, compute Fourier 
transforms, execute data compression algorithms, and a 
vast amount of other compute-intensive algorithms. 
Today, both government and industry are finding that 
computational requirements, especially in real-time sys- 
tems, are becoming increasingly more challenging. As a 
result, many users are relying on multiprocessing solu- 
tions to meet the needs of these problems. To take advan- 
tage of multiprocessor architectures, novel methods are 
needed to facilitate the mapping of DSP applications 
onto multiple processors. Consequently, the DSP market 
has exploded with new and innovative DSP hardware 
and software architectures which provide mechanisms to 
efficiently exploit the parallelism inherent in many DSP 
applications. The dataflow paradigm has also been get- 
ting considerable attention in the areas of DSP and 
real-time systems. The commercial products that are 


offered today utilize the dataflow paradigm as a graphi- 
cal programming language but do not incorporate data- 
flow analyses in designing a multiprocessing solution. 
Although there are many advantages to graphical pro- 
gramming, the full potential of the dataflow representa- 
tion is lost by not utilizing it analytically as well. In the 
absence of the analysis/design offered by this software 
tool, the commercial tool sets must rely on compile-time 
approximate solutions (heuristics) or run-time scheduling 
which often results in a trial -and-error design approach. 
Not only can this tool lend itself to NASA aerospace 
DSP problems, but it is felt that this tool has high com- 
mercial potential as well. It could be readily incorporated 
into existing commercial DSP tool sets to determine a 
desirable multiprocessing solution at compile time. Other 
commercial uses of this tool include scheduling of DSP 
algorithms for real-time applications, including those 
found in aircraft, automotive, and industrial processes. 
The tool could also provide front-end scheduling con- 
straints for other commercial tools utilizing job- 
scheduling algorithms with the potential of finding better 
solutions. 

Extensions to the Design Tool planned include 
incorporating heuristics to automate the selection of con- 
trol edges for optimal or near-optimal scheduling solu- 
tions. Also, enhancements to the underlying model and 
control edge heuristics are planned which will permit the 
design of real-time multiprocessing applications for both 
hard and soft deadlines (ref. 18). For hard real-time mod- 
eling, the design would assume worst-case task latencies. 
It has been observed that under such assumptions, 
run-time behavior may result in anomalous behavior 
such as requiring more processors than indicated from 
the worst-case scenario (ref. 19). However, such anoma- 
lies can be avoided by inserting additional control edges 
which impose stability criteria (ref. 19). Incorporating a 
stability criteria algorithm similar to reference 19 would 
allow the Design Tool to not only determine control 
edges for increased performance, but to also guarantee 
hard deadlines. In the context of DSP systems, the 
Design Tool is capable of supporting only a single sam- 
pling rate per graph. Many DSP algorithms require mul- 
tiple sampling rates which is equivalent to graph nodes 
consuming and depositing multiple tokens per firing as 
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opposed to only one token. Enhancements are planned to 
the graph-analysis techniques which will support multi- 
ple sampling rates within a DSP algorithm. 

7. Concluding Remarks 

Graph-searching algorithms were defined and shown 
to effectively determine scheduling constraints on a task 
system represented by a dataflow graph. The dataflow 
graph was shown to determine performance bounds 
inherent in the task system, task instantiations, and buffer 
requirements for the data shared between tasks. Gantt 
charts were shown to be useful in depicting periodic task 
schedules, scheduling constraints, processor require- 
ments, and processor utilization based on the dataflow 
graph analysis. An equivalent modified dataflow graph 
was presented for the modeling of initial conditions in 
the graph. Such initial conditions were not only shown to 
complicate the calculation of task mobility but may also 
cause a transient condition. A timing relationship 
imposed on the modified graph was shown to separate 


the steady-state behavior from the transient state. A soft- 
ware implementation of the design algorithms and proce- 
dures referred to as the “Design Tool” was presented and 
shown to facilitate the selection of a graph-theoretic 
multiprocessing solution. The addition of artificial data 
dependencies (control edges) was shown to be a viable 
technique for improving scheduling performance by 
reducing the processor requirements. The selection of an 
optimum solution is based on user-selected criteria, that 
is, a particular TBO (time between outputs), TBIO (time 
between input and output), and R (number of required 
processors) or trade-offs when a solution which opti- 
mizes all three parameters cannot be found or may not 
exist. Optimizations with the use of the Design Tool by 
inserting control edges were demonstrated. 


NASA Langley Research Center 
Hampton, VA 23681-0001 
February 1, 1995 
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Appendix 

Implementation of ES Algorithm and LF Algorithm 

The C++ program code which implements the ES and LF Algorithms is provided in this appendix. These func- 
tions are private to the C++ Graph object which constructs and analyzes the dataflow graph. The SearchFwd function 
is called by the f indEarliestStart function to provide a depth-first search of the graph and determine the earliest 
start times of all nodes. The SearchBkwd function effectively mirrors the SearchFwd function to provide a 
depth-first search of the graph in the opposite direction. The SearchFwd and SearchBkwd functions are used by the 
findLatest Finish function to determine the latest finish times of all nodes. 


//Declaration of node and edge types 

// DATA data edges found in graph text file, 

// CONTROL ... control edges already present in graph text file, 

// NEW control edges added by this tool, 

// VIRTUAL ... fictitious edges added to model inter-iteration dependencies, and 
// SPECIAL ... control edges added to source input for input injection control. 


enum nodetype { NODE, SOURCE, SINK, VIRTUAL_SOURCE , VIRTUAL_SINK }; 
enum edgetype { DATA, CONTROL, NEW, VIRTUAL, SPECIAL }; 
typedef int ClockTicks; 


struct Times { ClockTicksread, //time to read input data 

process, //time to process data 

write, //time to write output data 

ear 1 iest_start , //earliest possible start time 
latest_f inish, //latest finish time 
fire; }; //time to fire node 


class Node { char name [SIZE] ; 

nodetype type; 
int number, 
graph , 
priority, 
instances , 
data_set ; 
Times time; 


//node name 
//node type 
//node # 

//graph # 

//task priority 
//required instantiations 
//relative data set # 
//node times 


public : 

class Node ^previous, *next; 
class Edge *input, *output; 


public/private methods. . . ; }; 


class Edge { int number, 
token_limit , 
tokens, 

edgetype type; 


//edge # 

//queue size = initially empty + initially full 
//initial tokens = initially full queue slots 
//edge type 


public : 

class Edge ^previous, *next; 
class Node *initial, ^terminal; 
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class Edge *next_input, *next_output ; 
public/private methods...; }; 

// SearchFwd( Edge*, ClockTicks ) 

// Implements a forward search of the graph starting from an Edge until 
// a sink is found. Used by f indEarliestStart and f indLatestFinish . 

void SearchFwd( Edge *edgeptr, ClockTicks latency ) 

{ 

while ( edgeptr != NULL ) 

{ 

if (edgeptr->next_output != NULL) 

SearchFwd( edgeptr - >next_output , latency ); 

nodeptr = edgeptr- >terminal ; 

// exclude SPECIAL edges, which terminate on sources 
if ( edgeptr->terminal->Type ( ) == SOURCE ) 

return; 

if ( latency > nodeptr- >GetES ( ) ) 

nodeptr->SetES ( latency ); 

if ( nodeptr- >Type ( ) = = NODE ) 
latency += nodeptr->Latency ( ) ; 

edgeptr = nodept r->output ; 

} //end while 
return; 

} //end. 


// f indEarliestStart ( ) 

// Determine the earliest start times of all nodes by searching forward from 
// all sources. Calls SearchFwd. 

void f indEarliestStart { ) 

{ 

Node *nodeptr; 

//initialize earliest start times to zero 

for ( nodeptr = first_node; nodeptr != NULL; nodeptr = nodept r->next ; ) 

nodeptr->SetES ( 0 ); 

nodeptr = first_node; 

while (nodeptr != NULL) 

{ 

//find and hold the place of a source 
while ( (nodeptr->Type ( ) != SOURCE) && 

( nodept r->Type ( ) != VIRTUAL_SOURCE ) && 

(nodeptr->next != NULL) ) 
nodeptr = nodeptr- >next; 
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if ( (nodeptr->Type ( ) == SOURCE) II 

(nodeptr- >Type ( ) == VIRTUAL_SOURCE) ) 
SearchFwdf nodeptr->output , 0 ); 
nodeptr = nodeptr->next ; 

}//end while 
return ; 

} / /end. 


// SearchBkwd{ Edge *, ClockTicks ) 

// Implements a backward search of the graph from an Edge until a source is 
// found. Used by f indLatestFinish . 

void SearchBkwd< Edge *edgeptr, ClockTicks latency ) 

{ 

while (edgeptr 1= NULL) 

{ 

if ( edgeptr- >next_input != NULL) 

SearchBkwd ( edgeptr- >next_input , latency ); 

nodeptr = edgeptr->init ial ; 

//determine latest finish time 
if ( latency < nodeptr->GetLF ( ) ) 

nodeptr->SetLF ( latency ); 

if ( nodeptr->Type ( ) = = NODE ) 

latency -= nodeptr->Latency ( ) ; 

if { ( nodeptr- >Type { ) == SOURCE) | | 

(nodept r->Type ( ) == VIRTUAL_SOURCE) ) 
return; 

edgeptr = nodeptr->input ; 

}// end while 
return ; 

}// end. 


// f indLatestFinish ( ) 

// Determine the latest finish times of all nodes by searching backward from 
// all sinks. For sinks created from edges with initial tokens, the latest 
// finish rule states: LF(Sink) = ES(Nt) + d * TBO where Nt is the terminal 
// node of original edge (sink now points to this node) and d is the number of 
// initial tokens on the original edge. Calls SearchBkwd and SearchFwd. 

void f indLatestFinish ( ) 

{ 
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ClockTicks E S, LF, delta; 
struct Node *nodeptr, *succ_node; 
BOOL Done = FALSE; 



while { IDone ) 

{ 

Done = TRUE ; 

//initialize latest finish times to maximum storage value 
for ( nodeptr = first_node; nodeptr != NULL; nodeptr = nodeptr->next ; ) 

nodeptr->SetLF ( 0x7FFF ); 

nodeptr = first_node; 

while ( nodeptr != NULL ) 

{ 

//find and hold the place of a sink 
while ( (nodeptr->Type ( ) != SINK) && 

(nodeptr->Type ( ) 1 = VIRTUAL_SINK ) && 

(nodeptr->next NULL) ) 
nodeptr = nodeptr->next ; 

if ( (nodeptr->Type { ) = = SINK) I I 

(nodeptr->Type { ) = = VIRTUAL_SINK) ) 


{ 

//if sink is a result of initial tokens on an edge then 
// LF(sink) = ES (terminal node) + d*TBO 
if ( nodeptr->Type ( ) == VIRTUAL_SINK ) 

{ 

// node receiving tokens from sink 
succ^node = getNode ( nodeptr - >Name ( ) ); 

LF = succ_node->GetES ( ) + (nodeptr->input->Tokens { ) * TBO) ; 

// If delta = EF - LF > 0 then a timing violation has been 
// detected. Must increase ES (terminal node) by delta to satisfy 
// timing relationship. After doing so, propagate the updated 
// ES time to all descendents. Note: EF of initial node is 
// equal to ES of sink. 

if ( (delta = nodeptr ->GetES ( ) - LF) > 0 ) 

{ 

Done = FALSE; 

ES = succ__node->GetES ( ) + delta; 

//Delay the start time of node 
succ_node->SetES ( ES ); 

// Propagate the updated ES to all descendents 

SearchFwd ( succ_node->output f ES + succ_node->Latency ( ) ); 

LF += delta; 

}//end if delta > 0 

}//end if virtual sink due to initial tokens 
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else LF = nodeptr->GetES ( ) ; 


SearchBkwd( nodeptr->input , LF ); 

}//end if sink 

nodeptr = nodept r->next ; 

}//end while more paths 
}//end while not Done 
return ; 

} / /end. 
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